Note to self: Use http://www.stat.cmu.edu/~cshalizi/rmarkdown/ to figure out math symbols!
In this exercise you’ll create some simulated data from a linear model of a continuous outcome and will fit simple regression models to them. Make sure to set a seed prior to starting part (a) to ensure consistent results. Use base R and the rms package.
set.seed(1984)
x <- rnorm(100, 0, 1)
set.seed(1776)
eps <- rnorm(100, 0, 0.25)
Y = -1 + 0.5X + \(\epsilon\)
y <- -1 + 0.5 * x + eps
plot(x, y)
Result is what appears to be an approximately linear relationship between x and y
g <- rms::ols(y ~ x, x=T, y=T, se.fit=T)
g
## Linear Regression Model
##
## rms::ols(formula = y ~ x, x = T, y = T, se.fit = T)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 100 LR chi2 155.61 R2 0.789
## sigma0.2725 d.f. 1 R2 adj 0.787
## d.f. 98 Pr(> chi2) 0.0000 g 0.599
##
## Residuals
##
## Min 1Q Median 3Q Max
## -0.58362 -0.17994 -0.01112 0.18878 0.76363
##
##
## Coef S.E. t Pr(>|t|)
## Intercept -0.9757 0.0273 -35.78 <0.0001
## x 0.5570 0.0291 19.15 <0.0001
##
anova(g)
## Analysis of Variance Response: y
##
## Factor d.f. Partial SS MS F P
## x 1 27.220947 27.22094654 366.54 <.0001
## REGRESSION 1 27.220947 27.22094654 366.54 <.0001
## ERROR 98 7.277929 0.07426458
g_dd <- datadist(x, y)
options(datadist="g_dd")
summary(g)
## Effects Response : y
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## x -0.62964 0.80631 1.436 0.79983 0.041777 0.71692 0.88273
AIC(g)
## d.f.
## 27.75532
hist(g$residuals, breaks=14)
plot(g$residuals, x)
plot(x, y)
abline(g)
Line generated by least squares method goes through center of random data that has a linear shape. By visual inspection, some are close to the linear function, but most are spread evenly around the line.
As shown via the residual plot, the values are off from the regression line by apparently a random amount (doesn’t seem to have a visual pattern) ranging from -0.6 to 0.8.
x2 = x^2
p <- rms::ols(y ~ x + x2, x=T, y=T, se.fit=T)
p
## Linear Regression Model
##
## rms::ols(formula = y ~ x + x2, x = T, y = T, se.fit = T)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 100 LR chi2 158.02 R2 0.794
## sigma0.2706 d.f. 2 R2 adj 0.790
## d.f. 97 Pr(> chi2) 0.0000 g 0.602
##
## Residuals
##
## Min 1Q Median 3Q Max
## -0.5991 -0.1812 0.0140 0.1976 0.7676
##
##
## Coef S.E. t Pr(>|t|)
## Intercept -0.9419 0.0349 -27.03 <0.0001
## x 0.5655 0.0294 19.23 <0.0001
## x2 -0.0388 0.0252 -1.54 0.1270
##
anova(p)
## Analysis of Variance Response: y
##
## Factor d.f. Partial SS MS F P
## x 1 27.0724905 27.0724905 369.63 <.0001
## x2 1 0.1735035 0.1735035 2.37 0.127
## REGRESSION 2 27.3944500 13.6972250 187.01 <.0001
## ERROR 97 7.1044253 0.0732415
p_dd <- datadist(x, x2, y)
options(datadist="p_dd")
summary(p)
## Effects Response : y
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## x -0.62964 0.80631 1.4360 0.81200 0.042235 0.72817 0.895820
## x2 0.13760 1.33190 1.1943 -0.04628 0.030069 -0.10596 0.013398
AIC(p)
## d.f.
## 27.34248
#plot(p$residuals, x)
hist(residuals(p), breaks=14)
plot(fitted(p),residuals(p))
plot(x, y)
y_pred <- predict(p, data.frame(x, x2))
lines(sort(x), y_pred[order(x)], col = "red")
The p value for the x^2 term is 0.1270, so would not meet the standard 0.05 threshold. Also, the coeffiecient for the x^2 term is close to 0, so doesn’t impact the fit line by much. Visually, the quadratic fit line isn’t much better than the linear line; additionally, the histogram for the residuals are similar between the linear and polynomial equations. Despite the poor p value, low coefficient, and similar residuals between models, the AIC for the polynomial is slightly better than the AIC for the linear model (though neither is very good).
set.seed(1984)
x_ln <- rnorm(100, 0, 1)
set.seed(1776)
eps_ln <- rnorm(100, 0, 0.125)
y_ln <- -1 + 0.5 * x_ln + eps_ln
plot(x_ln, y_ln)
g_ln <- rms::ols(y_ln ~ x_ln, x=T, y=T, se.fit=T)
g_ln
## Linear Regression Model
##
## rms::ols(formula = y_ln ~ x_ln, x = T, y = T, se.fit = T)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 100 LR chi2 267.20 R2 0.931
## sigma0.1363 d.f. 1 R2 adj 0.930
## d.f. 98 Pr(> chi2) 0.0000 g 0.568
##
## Residuals
##
## Min 1Q Median 3Q Max
## -0.291808 -0.089971 -0.005558 0.094390 0.381813
##
##
## Coef S.E. t Pr(>|t|)
## Intercept -0.9878 0.0136 -72.46 <0.0001
## x_ln 0.5285 0.0145 36.33 <0.0001
##
anova(g_ln)
## Analysis of Variance Response: y_ln
##
## Factor d.f. Partial SS MS F P
## x_ln 1 24.506556 24.50655555 1319.96 <.0001
## REGRESSION 1 24.506556 24.50655555 1319.96 <.0001
## ERROR 98 1.819482 0.01856614
g_ln_dd <- datadist(x_ln, y_ln)
options(datadist="g_ln_dd")
summary(g_ln)
## Effects Response : y_ln
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## x_ln -0.62964 0.80631 1.436 0.7589 0.020888 0.71745 0.80036
print('AIC:')
## [1] "AIC:"
AIC(g_ln)
## d.f.
## -110.8741
hist(g_ln$residuals, breaks=14)
plot(g_ln$residuals, x_ln)
plot(x_ln, y_ln)
abline(g_ln, col='red')
x2_ln = x_ln^2
p_ln <- rms::ols(y_ln ~ x_ln + x2_ln, x=T, y=T, se.fit=T)
p_ln
## Linear Regression Model
##
## rms::ols(formula = y_ln ~ x_ln + x2_ln, x = T, y = T, se.fit = T)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 100 LR chi2 269.61 R2 0.933
## sigma0.1353 d.f. 2 R2 adj 0.931
## d.f. 97 Pr(> chi2) 0.0000 g 0.569
##
## Residuals
##
## Min 1Q Median 3Q Max
## -0.299550 -0.090579 0.006998 0.098804 0.383779
##
##
## Coef S.E. t Pr(>|t|)
## Intercept -0.9710 0.0174 -55.72 <0.0001
## x_ln 0.5327 0.0147 36.23 <0.0001
## x2_ln -0.0194 0.0126 -1.54 0.1270
##
anova(p_ln)
## Analysis of Variance Response: y_ln
##
## Factor d.f. Partial SS MS F P
## x_ln 1 24.02860017 24.02860017 1312.29 <.0001
## x2_ln 1 0.04337587 0.04337587 2.37 0.127
## REGRESSION 2 24.54993143 12.27496571 670.38 <.0001
## ERROR 97 1.77610633 0.01831037
p_ln_dd <- datadist(x_ln, x2_ln, y_ln)
options(datadist="p_ln_dd")
summary(p_ln)
## Effects Response : y_ln
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## x_ln -0.62964 0.80631 1.4360 0.76499 0.021117 0.723080 0.8069000
## x2_ln 0.13760 1.33190 1.1943 -0.02314 0.015034 -0.052979 0.0066992
print('AIC:')
## [1] "AIC:"
AIC(p_ln)
## d.f.
## -111.287
#plot(p$residuals, x)
hist(residuals(p_ln), breaks=14)
plot(fitted(p_ln),residuals(p_ln))
plot(x_ln, y_ln)
y_pred <- predict(p_ln, data.frame(x_ln, x2_ln))
lines(sort(x_ln), y_pred[order(x_ln)], col = "red")
As with the noisier data set previously shown, the p value for the x^2 term does not meet the standard 0.05 threshold. In this case, the coeffiecient for the x^2 term is closer to 0 than before.
A visual comparison of the fit line and histogram appears that the linear and polynomial are still very similar.
For this less noisy dataset, the AIC for the polynomial is still better than the linear, though again they are both very similar. In comparison to the original random dataset, the AICs for these models has improved.
q2df <- data.frame(x, y, x2, eps)
ggplot(q2df, aes(x=x, y=y)) +
geom_point() +
stat_smooth(method = 'lm', formula = y ~ x)
ggplot(q2df, aes(x=x, y=y)) +
geom_point() +
stat_smooth(method = 'lm', formula = y ~ poly(x,2))
q2df_ln <- data.frame(x_ln, y_ln, x2_ln, eps_ln)
ggplot(q2df, aes(x=x_ln, y=y_ln)) +
geom_point() +
stat_smooth(method = 'lm', formula = y ~ x)
ggplot(q2df, aes(x=x_ln, y=y_ln)) +
geom_point() +
stat_smooth(method = 'lm', formula = y ~ poly(x,2))
As shown by the 95% confidence interval around the regression line/polynomial, there is some uncertainty in the best location for the prediction. The ends of the prediction line have the largest confidence interval. In comparison to the original vs less noisy data set, the less noisy data set has a smaller confidence interval. If the data set were to be more noisy, I would expect a larger confidence interval around the regression line.
# had to modify call of rms::ols and add x=TRUE and y=TRUE
print('validate g')
## [1] "validate g"
validate(g)
## index.orig training test optimism index.corrected n
## R-square 0.7890 0.7926 0.7860 0.0067 0.7824 40
## MSE 0.0728 0.0697 0.0738 -0.0042 0.0769 40
## g 0.5987 0.5932 0.5987 -0.0056 0.6043 40
## Intercept 0.0000 0.0000 -0.0140 0.0140 -0.0140 40
## Slope 1.0000 1.0000 0.9908 0.0092 0.9908 40
print('validate p')
## [1] "validate p"
validate(p)
## index.orig training test optimism index.corrected n
## R-square 0.7941 0.7916 0.7889 0.0026 0.7914 40
## MSE 0.0710 0.0703 0.0728 -0.0026 0.0736 40
## g 0.6015 0.5951 0.6004 -0.0053 0.6068 40
## Intercept 0.0000 0.0000 0.0108 -0.0108 0.0108 40
## Slope 1.0000 1.0000 1.0111 -0.0111 1.0111 40
According to the rms::validate.ols documentation, “The validate function does resampling validation of a multiple linear regression model… Uses resampling to estimate the optimism in various measures of predictive accuracy which include R^2” Alternatively a bootstrap distribution of R^2 would find via sampling with replacement the distribution of possible R^2 values from the model.
An extensive data set was used to develop a prognostic model for 6-month unfavorable outcome (d.unfav) after moderate and severe TBI (d.gos = 3, 4). The strongest predictors were age, motor score and pupil reactivity. See “TBI Background Document.pdf” (in Canvas under Files -> Data) for the data dictionary. Note: Truncated systolic blood pressure (d.sysbt) is also included in the dataset but does not appear on the list in the document. The data are from Steyerberg and are in SPSS format. The dataset can be found in Canvas under Files -> Data -> TBI.sav.
wd <- getwd()
setwd("..")
parent <- getwd()
tbi <- read.spss(file = paste(parent,'TBI.sav', sep='/'), to.data.frame = TRUE)
## re-encoding from CP1252
print(paste(parent,'TBI.sav', sep='/'))
## [1] "/Users/seth/OneDrive - The University of Colorado Denver/Documents/BIOS_6645/TBI.sav"
setwd(wd)
# filter dataset
tbi = tbi[tbi$trial == 'Tirilazad US', ]
# convert d.motor column to factor
tbi$d.motor <- as.factor(tbi$d.motor)
LRM model using built in factor method to create dummy variables
tbi_model <- lrm(d.unfav ~ age + d.motor + pupil.i, data=tbi, x=TRUE, y=TRUE)
tbi_model
## Logistic Regression Model
##
## lrm(formula = d.unfav ~ age + d.motor + pupil.i, data = tbi,
## x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 1041 LR chi2 262.94 R2 0.304 C 0.784
## 0 646 d.f. 8 g 1.346 Dxy 0.567
## 1 395 Pr(> chi2) <0.0001 gr 3.841 gamma 0.569
## max |deriv| 2e-11 gp 0.266 tau-a 0.267
## Brier 0.179
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -1.5203 0.7755 -1.96 0.0499
## age 0.0442 0.0060 7.33 <0.0001
## d.motor=2 0.4352 0.7625 0.57 0.5682
## d.motor=3 0.0698 0.7608 0.09 0.9269
## d.motor=4 -0.7078 0.7479 -0.95 0.3440
## d.motor=5 -1.4894 0.7498 -1.99 0.0470
## d.motor=6 -1.9808 0.8507 -2.33 0.0199
## pupil.i=one reactive 0.3042 0.2232 1.36 0.1729
## pupil.i=no reactive pupils 1.2298 0.1921 6.40 <0.0001
##
AIC(tbi_model)
## [1] 1137.076
tbi_dd <- datadist(tbi)
options(datadist="tbi_dd")
summary(tbi_model)
## Effects Response : d.unfav
##
## Factor Low High Diff. Effect
## age 23 41 18 0.79546
## Odds Ratio 23 41 18 2.21550
## d.motor - 1:5 5 1 NA 1.48940
## Odds Ratio 5 1 NA 4.43430
## d.motor - 2:5 5 2 NA 1.92460
## Odds Ratio 5 2 NA 6.85210
## d.motor - 3:5 5 3 NA 1.55910
## Odds Ratio 5 3 NA 4.75470
## d.motor - 4:5 5 4 NA 0.78155
## Odds Ratio 5 4 NA 2.18490
## d.motor - 6:5 5 6 NA -0.49142
## Odds Ratio 5 6 NA 0.61175
## pupil.i - one reactive:both reactive 1 2 NA 0.30416
## Odds Ratio 1 2 NA 1.35550
## pupil.i - no reactive pupils:both reactive 1 3 NA 1.22980
## Odds Ratio 1 3 NA 3.42050
## S.E. Lower 0.95 Upper 0.95
## 0.10859 0.582620 1.00830
## NA 1.790700 2.74090
## 0.74983 0.019721 2.95900
## NA 1.019900 19.27900
## 0.24740 1.439700 2.40950
## NA 4.219300 11.12800
## 0.23633 1.095900 2.02230
## NA 2.992000 7.55600
## 0.18096 0.426870 1.13620
## NA 1.532400 3.11500
## 0.43750 -1.348900 0.36605
## NA 0.259530 1.44200
## 0.22316 -0.133220 0.74154
## NA 0.875270 2.09920
## 0.19205 0.853370 1.60620
## NA 2.347500 4.98380
plot(anova(tbi_model), what='proportion chisq') # relative importance
plot(Predict(tbi_model, fun=plogis)) # predicted values
#
# penalty <- pentrace(mod1, penalty=c(0.5,1,2,3,4,6,8,12,16,24), maxit=25)
# mod1_pen <- update(mod1, penalty=penalty$penalty)
# effective.df(mod1_pen)
# mod1_pen
Alternative method with dummy encoding instead of factors
# can use dummies package to create dummy encoded variables
#install.packages('dummies')
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
d_tbi = dummy.data.frame(tbi, names = c("d.motor","pupil.i") , sep = "_")
# or model.matrix to create dummy encoded variables
#model.matrix(~tbi$d.motor)
d_model <- lrm(d.unfav ~ age + d.motor_1 + d.motor_2 + d.motor_3 + d.motor_4 + d.motor_6 + `pupil.i_no reactive pupils` + `pupil.i_one reactive`, data=d_tbi, x=TRUE, y=TRUE)
d_model
## Logistic Regression Model
##
## lrm(formula = d.unfav ~ age + d.motor_1 + d.motor_2 + d.motor_3 +
## d.motor_4 + d.motor_6 + `pupil.i_no reactive pupils` + `pupil.i_one reactive`,
## data = d_tbi, x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 1041 LR chi2 262.94 R2 0.304 C 0.784
## 0 646 d.f. 8 g 1.346 Dxy 0.567
## 1 395 Pr(> chi2) <0.0001 gr 3.841 gamma 0.569
## max |deriv| 1e-11 gp 0.266 tau-a 0.267
## Brier 0.179
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -3.0097 0.2549 -11.81 <0.0001
## age 0.0442 0.0060 7.33 <0.0001
## d.motor_1 1.4894 0.7498 1.99 0.0470
## d.motor_2 1.9246 0.2474 7.78 <0.0001
## d.motor_3 1.5591 0.2363 6.60 <0.0001
## d.motor_4 0.7815 0.1810 4.32 <0.0001
## d.motor_6 -0.4914 0.4375 -1.12 0.2613
## pupil.i_no reactive pupils 1.2298 0.1921 6.40 <0.0001
## pupil.i_one reactive 0.3042 0.2232 1.36 0.1729
##
d_tbi_dd <- datadist(d_tbi)
options(datadist="d_tbi_dd")
summary(d_model)
## Effects Response : d.unfav
##
## Factor Low High Diff. Effect S.E. Lower 0.95
## age 23 41 18 0.79546 0.10859 0.582620
## Odds Ratio 23 41 18 2.21550 NA 1.790700
## d.motor_1 0 1 1 1.48940 0.74983 0.019721
## Odds Ratio 0 1 1 4.43430 NA 1.019900
## d.motor_2 0 1 1 1.92460 0.24740 1.439700
## Odds Ratio 0 1 1 6.85210 NA 4.219300
## d.motor_3 0 1 1 1.55910 0.23633 1.095900
## Odds Ratio 0 1 1 4.75470 NA 2.992000
## d.motor_4 0 1 1 0.78155 0.18096 0.426870
## Odds Ratio 0 1 1 2.18490 NA 1.532400
## d.motor_6 0 1 1 -0.49142 0.43750 -1.348900
## Odds Ratio 0 1 1 0.61175 NA 0.259530
## pupil.i_no reactive pupils 0 1 1 1.22980 0.19205 0.853370
## Odds Ratio 0 1 1 3.42050 NA 2.347500
## pupil.i_one reactive 0 1 1 0.30416 0.22316 -0.133220
## Odds Ratio 0 1 1 1.35550 NA 0.875270
## Upper 0.95
## 1.00830
## 2.74090
## 2.95900
## 19.27900
## 2.40950
## 11.12800
## 2.02230
## 7.55600
## 1.13620
## 3.11500
## 0.36605
## 1.44200
## 1.60620
## 4.98380
## 0.74154
## 2.09920
plot(anova(d_model), what='proportion chisq') # relative importance
rms::validate(d_model, method="boot", B=500) # bootstrapped validation
## singular information matrix in lrm.fit (rank= 8 ). Offending variable(s):
## d.motor_1
##
## Divergence or singularity in 1 samples
## index.orig training test optimism index.corrected n
## Dxy 0.5672 0.5728 0.5605 0.0123 0.5548 499
## R2 0.3037 0.3122 0.2951 0.0171 0.2866 499
## Intercept 0.0000 0.0000 -0.0154 0.0154 -0.0154 499
## Slope 1.0000 1.0000 0.9601 0.0399 0.9601 499
## Emax 0.0000 0.0000 0.0116 0.0116 0.0116 499
## D 0.2516 0.2600 0.2435 0.0165 0.2351 499
## U -0.0019 -0.0019 0.0007 -0.0026 0.0007 499
## Q 0.2535 0.2619 0.2428 0.0191 0.2345 499
## B 0.1792 0.1774 0.1810 -0.0036 0.1828 499
## g 1.3459 1.3814 1.3212 0.0602 1.2857 499
## gp 0.2663 0.2692 0.2618 0.0074 0.2589 499
my.calib <- rms::calibrate(d_model, method="boot", B=500) # model calibration
plot(my.calib, las=1)
##
## n=1041 Mean absolute error=0.005 Mean squared error=6e-05
## 0.9 Quantile of absolute error=0.014
#library(splines)
plot(smooth.spline(tbi$age, tbi$d.unfav, df=4), type="l", ylim=c(0,1))
points(tbi$age, tbi$d.unfav)
ggplot(tbi, aes(y=tbi$d.unfav, x=tbi$age)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam'
By plotting age vs unfavarable outcome it appears that younger age (from minimum to ~ 25) has little to no effect on an unfavorable outcome. Beyond age 25 there is an increasing linear relationship between age and unfavorable outcome.
tbi_dd <- datadist(tbi)
options(datadist="tbi_dd")
ggplot(Predict(tbi_model, age, d.motor, pupil.i, fun=plogis))
ggplot(Predict(tbi_model, age, pupil.i, d.motor, fun=plogis))
y_prob <- predict(tbi_model, tbi, type="fitted")
plot(tbi$d.unfav, y_prob)
noquote((c('Fraction of unfavavorable outcomes in dataset: ', sum(tbi$d.unfav)/nrow(tbi))))
## [1] Fraction of unfavavorable outcomes in dataset:
## [2] 0.379442843419789
# start with the naive threshold of 0.5
y_hat <- ifelse(y_prob <= 0.5, 1, 0)
noquote((c('Predicted fraction of unfavavorable outcomes in dataset: ', sum(y_hat)/nrow(tbi))))
## [1] Predicted fraction of unfavavorable outcomes in dataset:
## [2] 0.704130643611912
# use ROC curve to find better threshold
roc.tbi <- roc(tbi$d.unfav, y_prob)
## Best cutoff by closest to topleft method
coords.tbi <- coords(roc = roc.tbi , x= "best" , input = "threshold", best.method = "closest.topleft")
y_hat <- ifelse(y_prob <= coords.tbi[1], 1, 0)
noquote((c('Alternate: Predicted fraction of unfavavorable outcomes in dataset: ', sum(y_hat)/nrow(tbi))))
## [1] Alternate: Predicted fraction of unfavavorable outcomes in dataset:
## [2] 0.529298751200769
rms::validate(tbi_model, method="boot", B=500) # bootstrapped validation
## index.orig training test optimism index.corrected n
## Dxy 0.5672 0.5749 0.5605 0.0144 0.5528 500
## R2 0.3037 0.3139 0.2949 0.0190 0.2847 500
## Intercept 0.0000 0.0000 -0.0201 0.0201 -0.0201 500
## Slope 1.0000 1.0000 0.9563 0.0437 0.9563 500
## Emax 0.0000 0.0000 0.0132 0.0132 0.0132 500
## D 0.2516 0.2616 0.2433 0.0183 0.2334 500
## U -0.0019 -0.0019 0.0005 -0.0024 0.0005 500
## Q 0.2535 0.2635 0.2428 0.0207 0.2328 500
## B 0.1792 0.1771 0.1810 -0.0039 0.1831 500
## g 1.3459 1.3868 1.3205 0.0663 1.2796 500
## gp 0.2663 0.2701 0.2617 0.0084 0.2579 500
my.calib <- rms::calibrate(tbi_model, method="boot", B=500) # model calibration
plot(my.calib, las=1)
##
## n=1041 Mean absolute error=0.005 Mean squared error=5e-05
## 0.9 Quantile of absolute error=0.012
The model overpredicts an unfavorable outcome with the selected predictors. Although it is stated that the strongest predictors were age, motor score and pupil reactivity, the p value obtained via rms.lrm are low for some. Perhaps a re-evaluation of ‘strongest predictors’ would be appropriate or looking at methods other than rms.lrm could yield better results.
Run the same model on the international observations (g below). Alternatively, find a new dataset with internal results and run on that.